Practical Applications of Locality Sensitive Hashing for Unstructured Data

ثبت نشده
چکیده

Working with large amounts of unstructured data (e.g., text documents) has become important for many business, engineering and scientific applications. The purpose of this article is to demonstrate how the practical Data Scientist can implement a Locality Sensitive Hashing system from start to finish in order to drastically reduce the time required to perform a similarity search in high dimensional space (e.g., created by the terms in the vector space model for documents). Locality Sensitive Hashing dramatically reduces the amount of data required for storage and comparison by applying probabilistic dimensionality reduction. In this paper we concentrate on the implementation of min-wise independent permutations (MinHashing) which provides an efficient way to determine an accurate approximation of the Jaccard similarity coefficient between sets (e.g., sets of terms in documents) [2,3].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Locality sensitive hashing: A comparison of hash function types and querying mechanisms

It is well known that high-dimensional nearest-neighbor retrieval is very expensive. Dramatic performance gains are obtained using approximate search schemes, such as the popular Locality-Sensitive Hashing (LSH). Several extensions have been proposed to address the limitations of this algorithm, in particular, by choosing more appropriate hash functions to better partition the vector space. All...

متن کامل

LV Barcoding: locality sensitive hashing-based tool for rapid species identification in DNA barcoding

DNA barcoding has emerged as a cost-effective approach for species identification. However, the scarcity of tools used for searching the booming reference database becomes an obstacle, currently with BLAST as the only practical choice. Here, we propose a program LV Barcoding based on both the random hyperplane projection-based locality sensitive hashing method and the composition vector-based V...

متن کامل

Multi-Level Spherical Locality Sensitive Hashing For Approximate Near Neighbors

This paper introduces “Multi-Level Spherical LSH”: parameter-free, a multi-level, data-dependant Locality Sensitive Hashing data structure for solving the Approximate Near Neighbors Problem (ANN). This data structure is a modified version multi-probe adaptive querying algorithm, with the potential of achieving a O(np + t) query run time, for all inputs n where t <= n. Keywords—Locality Sensitiv...

متن کامل

Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

Similarity search is critical for many database applications, including the increasingly popular online services for Content-Based Multimedia Retrieval (CBMR). These services, which include image search engines, must handle an overwhelming volume of data, while keeping low response times. Thus, scalability is imperative for similarity search in Webscale applications, but most existing methods a...

متن کامل

APT: Approximate Period Detection in Time Series

Period detection from time series is an important problem with many real-world applications such as weather forecast, stock market predictions, electrocardiogram analysis, periodic disease outbreak. In this work, we present a novel approximate period detection method for time series. The simplicity of our algorithm and its adaptability for high dimensional datasets using renowned tools and tech...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014